Similarity-based data mining in files of two-dimensional chemical structures using fingerprint measures of molecular resemblance
نویسنده
چکیده
This paper reviews the use of measures of inter-molecular similarity for processing databases of chemical structures, which play an important role in the discovery of new drugs by the pharmaceutical industry. The similarity measures considered here are based on the use of a fingerprint representation of molecular structure, where a fingerprint is a vector encoding the presence of fragment substructures in a molecule and where the similarity between pairs of such fingerprints is computed using an association coefficient such as the Tanimoto coefficient. The Similar Property Principle provides the basic rationale for the use of similarity methods in three important chemoinformatics applications: similarity searching, database clustering, and molecular diversity analysis. Similarity searching enables the identification of those molecules in a database that are most similar to a userdefined, biologically active query molecule, with data fusion providing an effective way of combining the results of multiple similarity searches. Cluster analysis, typically using the Jarvis-Patrick, Ward or divisive k-means clustering methods, enables the cost-effective selection of molecules for biological testing, for property prediction and for investigating database overlap. Molecular diversity analysis, typically using cluster-based, dissimilarity-based or optimisation-based approaches, enables the identification of structurally diverse sets of molecules, so as to ensure that the full chemical space spanned by a database is tested in the search for novel bioactive molecules.
منابع مشابه
A Geometric View of Similarity Measures in Data Mining
The main objective of data mining is to acquire information from a set of data for prospect applications using a measure. The concerning issue is that one often has to deal with large scale data. Several dimensionality reduction techniques like various feature extraction methods have been developed to resolve the issue. However, the geometric view of the applied measure, as an additional consid...
متن کاملIdentification of BKCa channel openers by molecular field alignment and patent data-driven analysis
In this work, we present the first comprehensive molecular field analysis of patent structures on how the chemical structure of drugs impacts the biological binding. This task was formulated as searching for drug structures to reveal shared effects of substitutions across a common scaffold and the chemical features that may be responsible. We used the SureChEMBL patent database, which prov...
متن کاملTurbo similarity searching: Effect of fingerprint and dataset on virtual-screening performance
Turbo similarity searching uses information about the nearest neighbours in a conventional chemical similarity search to increase the effectiveness of virtual screening, with a data fusion approach being used to combine the nearest-neighbour information. A previous paper suggested that the approach was highly effective in operation; this paper further tests the approach using a range of differe...
متن کاملComparison of chemical clustering methods using graph- and fingerprint-based similarity measures.
This paper compares several published methods for clustering chemical structures, using both graph- and fingerprint-based similarity measures. The clusterings from each method were compared to determine the degree of cluster overlap. Each method was also evaluated on how well it grouped structures into clusters possessing a non-trivial substructural commonality. The methods which employ adjusta...
متن کاملComparison Of Chemical Clustering Methods Using Graph- Based And Fingerprint-Based Similarity Measures
This paper compares several published methods for clustering chemical structures, using both fingerprint-based and graph-based similarity measures. The clusterings from each method were compared to determine the degree of cluster overlap. Each method was also evaluated on how well it grouped structures into clusters possessing a non-trivial substructural commonality. The methods which employ ad...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Wiley Interdisc. Rew.: Data Mining and Knowledge Discovery
دوره 1 شماره
صفحات -
تاریخ انتشار 2011